Data Visualization Techniques

Venustiano Soancatl Aguilar

Content

  • Preprocessing
    • Table formats
    • Normalization
  • Visualization techniques
    • Distributions
      • Histograms
      • Violing plots
      • Box plots
  • Preparing for publication

Wide format

  • Wide format: each subject/measure is its own column. Good for human-readable reports and some modeling functions.

Example (wide):

Wide format
Each repeated measure has its own column
id math english
1 90 80
2 85 88

Long format

  • Long (tidy) format: one observation per row, each variable is a column (key-value pair for repeated measures). Preferred for tidy data workflows (dplyr/tidyr, pandas melt).
Long (tidy) format
One observation per row (id + subject -> score)
id subject score
1 math 90
2 math 85
1 english 80
2 english 88

Exergame dataset 1

First 10 rows of ../../data/exergamewf.csv (400x16), 40 participants
iSubj trial Age Decade medLrms51 medTI51 medK meanK medSpeed medTIms51 medLcovD51 medLsd51 medLcov51 medLrmslD51 medLsdD51 older
1 1 76 7 0.1894931 0.5895294 207.8009 333.8615 0.4887213 0.8063448 0.7447658 0.02060208 0.10868742 1.2591721 0.12906757 TRUE
1 2 76 7 0.2041523 0.6973378 223.2943 319.6015 0.5182781 0.7754507 0.5069439 0.01644558 0.07814028 1.2633661 0.10590245 TRUE
1 3 76 7 0.2173473 0.7861308 136.4008 272.4609 0.5841934 0.5119019 0.7061267 0.04737236 0.23646721 0.7827964 0.13391692 TRUE
1 4 76 7 0.1695694 1.0206954 162.0346 289.9279 0.5901514 0.4163981 0.7018032 0.03901695 0.26536092 0.4875698 0.10863750 TRUE
1 5 76 7 0.1707855 0.7697467 141.7677 300.6258 0.5517064 0.5500293 0.7531606 0.03679774 0.23590743 0.6461938 0.11960964 TRUE
1 6 76 7 0.1897457 0.8447487 135.1824 278.8230 0.5872821 0.4505710 0.6610873 0.04081446 0.24733843 0.5800137 0.11645546 TRUE
1 7 76 7 0.2233555 0.7575400 194.8239 300.0683 0.5530803 0.7030368 0.3478552 0.01441217 0.06038580 1.1837717 0.08177191 TRUE
1 8 76 7 0.2409097 0.7414884 226.1782 324.9164 0.5225434 0.7732494 0.2648009 0.01068888 0.04361652 1.4287875 0.06494999 TRUE
1 9 76 7 0.2347833 0.6954884 201.0556 300.1745 0.5300633 0.7637143 0.3943936 0.01528410 0.06477809 1.3693579 0.09403986 TRUE
1 10 76 7 0.2138546 0.6711597 203.4890 298.0097 0.5402634 0.7405751 0.4288368 0.01493881 0.06972877 1.2463371 0.09307257 TRUE

exergamewt.csv into long format

  1. Launch mybinder
  2. Go to the ./material/data folder and create a new python/R notebook
  3. Use the sparkle icon on cell toolbar to show the inline chat popover.
  4. Use the following prompts to generate the required code:
    1. Write python and R code to see the columns of the exergamewf.csv file.
    2. Convert the <exergamewf_data> into a data.table and create a long format table using the ‘iSubj’, ‘trial’, ‘Age’, ‘Decade’ and ‘older’ columns as keys.
    3. Plot a histogram in ggplot of the ‘value’ column of the <exergamewf_long> table.

Histogram

Normalization

Standard score:

\[x' = \frac{x-\mu}{\sigma}\]

Min-Max Feature scaling:

\[x' = \frac{x-min(x)}{max(x)-min(x)}\]

After min max normalization

Adding a distribution curve

Overlapping distributions1

Ridgeline plot1

Meet the violin plot

Packages and data

[1] "/__w/DataVisMaterial/DataVisMaterial/material/slides/vis_slides"
# Importing the packages
import plotly.express as px
import pandas as pd
# Loading and displaying the data
lf = pd.read_csv('../../data/exergamelf2.csv')
lf.head()
   idrow  Age  Decade  iSubj  trial     myVars     value   normVal  older
0      0   20       2     21      1  medLrms51  0.294961  0.879737  False
1      1   20       2     21      2  medLrms51  0.308855  0.926365  False
2      2   20       2     21      3  medLrms51  0.285579  0.848253  False
3      3   20       2     21      4  medLrms51  0.264111  0.776211  False
4      4   20       2     21      5  medLrms51  0.308285  0.924452  False
plot = px.violin(x=lf['myVars'], y=lf['normVal'])
plot.show()

Overlapping violin plots

plot = px.violin(x=lf['myVars'], y=lf['normVal'], color = lf["older"])
# make violins overlap and improve visibility
plot.update_traces(opacity=0.6, width=0.8)        # semi-transparent + narrower violins
plot.update_layout(violinmode='overlay')          # overlay (instead of side-by-side)
# optional: rotate x labels if long category names
plot.update_layout(xaxis_tickangle=45)
plot.show()

Preparing for publication

  • Remove unnecessary elements
    • Color background
    • Axis title
  • Select appropriate visual variables
    • color
  • Add appropriate labels
  • Save a high resolution image
  • Avoid chart junk

Colorbrewer

colorbrewer2.org

High quality plots with LaTex

  • tikz

  • tikzDevice (only in R)

    scripts/FDV/Fig2_3_temperature_plot_tikz.ipynb

  • Useful for publications

  • Match plot and main text fonts

  • Can include complex mathematical formulae

  • tikzDevice works with r magic in jupyter notebooks

A ggplot plot

p <- ggplot(long_dt, aes(x = variable, y = value_norm, fill = factor(older))) +
  geom_violin(position = position_identity(),   # overlay violins at same x
              alpha = 0.5, width = 0.8, trim = FALSE, color = NA) +
# scale_y_continuous(limits = c(0, 1), expand = c(0, 0)) +
  labs(title = "Overlapping violins of per-variable normalized values",
       x = "variable",
       y = "value (normalized 0-1)",
       fill = "older",
       color = "older") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

TikzDevice code

library(tikzDevice)   # assume installed

# 1) Write a standalone .tex that contains the TikZ picture
tikz("p_plot.tex", width = 6, height = 4, standAlone = TRUE)
print(p)   # print the ggplot object to the tikz device
dev.off()
png 
  2 
# 2) Compile to PDF (requires a LaTeX engine available)
# Option 1: via R's tools package (texi2pdf)
tools::texi2dvi("p_plot.tex", pdf = TRUE, clean = TRUE)

TikzDevice result